Building Spatial Statistical Models in R Using spmodel

EFI Statistical Methods Seminar Series

Michael Dumelle

USEPA

2026-01-12

About Me

  • Statistics Ph.D. in 2020 from Oregon State University
  • Research Interests: Spatial statistics, ecology, software development
  • CRAN software maintained: spmodel, SSN2, spsurvey
  • Lead statistician for USEPA’s National Aquatic Resource Surveys

The NARS

The National Aquatic Resource Surveys (NARS) are a collaboration between USEPA, states, and tribes to assess the quality of the nation’s aquatic resources using a statistical survey design

Roadmap

A tentative roadmap

  • The big picture
  • Some technical background
  • What is spatial covariance?
  • What are spatial linear models?

If time

  • Applications to large data sets
  • Spatial generalized linear models

The Big Picture

The Linear Model

Simple and multiple linear regression models (i.e., linear models) are important ecological tools

\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon\]

  • \(y\) is the response variable

    • dependent, outcome variable
  • \(x\)’s are explanatory variables

    • covariates, predictors, independent variables, etc.

    • Conditioned upon (i.e., accounted for) while modeling

  • \(\beta\)’s are fixed effect parameters

    • intercept and slope parameters
  • \(\epsilon\) is random error

The Linear Model

Linear models are powerful and flexible

  • Some examples of linear models: Splines, analysis of variance (ANOVA), penalized regression (e.g., ridge, lasso), (linear) mixed effects models, and many others

  • Nuances of their utility are often misunderstood

    • Assumptions are made directly on the errors, not the response

The Linear Model

Fit in R using the lm() function

lmod <- lm(formula = y ~ x1 + x2, data = dat)
summary(lmod)

Call:
lm(formula = y ~ x1 + x2, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.35959 -0.56643 -0.06193  0.48088  1.80592 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.09559    0.18177   0.526    0.606
x1           0.21951    0.20443   1.074    0.298
x2          -0.21386    0.21425  -0.998    0.332

Residual standard error: 0.7943 on 17 degrees of freedom
Multiple R-squared:  0.139, Adjusted R-squared:  0.03776 
F-statistic: 1.373 on 2 and 17 DF,  p-value: 0.2801

Tobler’s Law

When we use a linear model, we assume observations are independent

  • This assumption is typically impractical for spatial data

Tobler’s Law: Nearby observations tend to be more similar than distant observations (Tobler 1970; Miller 2004)

  • Also called the First Law of Geography
  • How can Tobler’s Law affect models?

A Simulation Study

Simulate data from a spatial linear model

\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \tau + \epsilon\]

More on the model structure later – for now,

  • \(\tau\) is spatial random error
  • Variable(s) recorded at locations with x/y-coordinates

A Simulation Study

True parameter values of interest

  • \(\beta_1 = 0\)
  • \(\beta_2 = 1\)

Our goal: Simulate a random set of data and estimate \(\beta_1\) and \(\beta_2\) using a nonspatial and a spatial linear model

  • The nonspatial model assumes independence
  • Sometimes “nonspatial” model and “independence” model are used interchangably

The Simulated Response

Figure 1: The spatial distribution of the response variable, y.

Parameter Inference

True value of \(\beta_1 = 0\)

  • Hope \(\hat{\beta}_1\) has a large p-value
Parameter inference for \(\beta_1\) in the nonspatial and spatial linear models.
method est se p.value conf.low conf.high
Nonspatial -0.57 0.05 0.00 -0.68 -0.47
Spatial 0.10 0.07 0.14 -0.03 0.24

Parameter Inference

True value of \(\beta_2 = 1\)

  • Hope \(\hat{\beta}_2\) has a small p-value
Parameter inference for \(\beta_2\) in the nonspatial and spatial linear models.
method est se p.value conf.low conf.high
Nonspatial 0.30 0.59 0.61 -0.85 1.46
Spatial 0.77 0.21 0.00 0.36 1.18

Cross Validation

Figure 2: Observations vs leave-one-out cross validation predictions for the nonspatial and spatial linear models.

Independence Consequences

Fitting nonspatial linear models to spatial data can lead to (Legendre 1993; Zimmerman and Ver Hoef 2024)

  • Increased Type-I error rate for \(\beta\) (false positive)
  • Increased Type-II error rate for \(\beta\) (false negative); decreased power for \(\beta\)
  • Flawed understanding of the ecological system (poor management decisions, etc.)

Fitting spatial linear models to spatial data helps remedy these concerns – but why?

Pseudoreplication

Spatial data share information, implying pseudoreplication (Hurlbert 1984)

  • Less independent pieces of information
  • Spatial models account for this pseudoreplication far more appropriately than nonspatial models
  • Spatial statistics (as a discipline) studies this type of pseudoreplication and its impact on analyses of spatial data

Pseudoreplication also affects other types of models (e.g., machine learning)

Challenges

If spatial models are more effective (than nonspatial models), why have they been used infrequently?

  • Technical challenges: Emerging discipline, theory-heavy
  • Computational challenges: Direct implementation can struggle for \(n > 3{,}000\)
  • Software challenges: Lack of accessible software to fit spatial models

Solutions

Accessibility to spatial models has greatly expanded over the past decade

  • Many developments technically, computationally, and in software
  • Our contribution: The spmodel R package

What is spmodel?

spmodel (Dumelle, Higham, and Ver Hoef 2023) is an R package for spatial statistics designed to:

  • Simplify the transition from nonspatial to spatial linear and generalized linear models
  • Build upon lm() and glm() to account for spatial covariance

Some Recent spmodel Applications

A few examples of recent uses

Some Technical Background

What is a Spatial Process?

Following Cressie (1993), we define a spatial process as

\[ \{y(\mathbf{s}): \mathbf{s} \in D \} \]

\(y(\mathbf{s})\) is the response variable at a location \(\mathbf{s}\) in the spatial domain \(D\)

  • Typically, \(\mathbf{s}\) is two dimensional, composed of an \(x\)-coordinate and a \(y\)-coordinate

Types of Spatial Data

Geostatistical (i.e., point-referenced) data

  • \(y(\mathbf{s})\) is random and \(D\) is fixed and continuous
  • E.g., sulfate deposition at locations in a field

Most ecological field data is geostatistical

  • The standard type of spatial data you collect while field sampling (i.e., observational studies)

Types of Spatial Data

Areal (i.e., lattice, polygon) data

  • \(y(\mathbf{s})\) is random and \(D\) is fixed and discrete
  • E.g., total income for each county in a state

Point pattern data

  • \(D\) is random
  • E.g., location of (mainshock) earthquakes

A Clarification

spmodel provides tools for geostatistical and areal data (not point pattern data)

  • \(y(\mathbf{s})\) is random and \(D\) is fixed

How is distance measured?

  • Geostatistical distance: Euclidean (our primary focus here)
  • Areal distance: Neighborhood structure

Both geostatistical and areal data are unified via the spatial linear and generalized linear model

Another Clarification

For notational simplicity, we henceforth drop the \((\mathbf{s})\) (assume it implicitly)

Then we can write

\[y(\mathbf{s}) = \beta_0 + \beta_1 x_1(\mathbf{s}) + \tau(\mathbf{s}) + \epsilon\]

as \[y = \beta_0 + \beta_1 x_1 + \tau + \epsilon\]

What is Covariance? Correlation?

The covariance between two random variables is a measure of how similar they are:

  • A large positive covariance implies similarity
  • A large negative covariance implies dissimilarity
  • Covariance near zero implies no relationship

The correlation between two random variables is a scaled, unit-less version of covariance:

  • Bounded between \([-1, 1]\)

Visualizing Covariance/Correlation

Figure 3: Positive covariance between two random variables, x and y.
Figure 4: Negative covariance between two random variables, x and y.
Figure 5: No covariance between two random variables, x and y.

Covariance vs Autocovariance vs Dependence?

Covariance (Correlation) vs autocovariance (autocorrelation):

  • Technically, “auto” (i.e., “serial”) means “with itself”
  • Often used interchangeably; we drop the “auto”

The blanket term “dependence” is less precise but almost always means covariance or correlation

What is Spatial Covariance?

What is Spatial Covariance?

We formalize the intuition behind Tobler’s Law via spatial covariance:

  • Spatial covariance is a measure of how similar observations are to one another based on their distance apart (i.e., proximity to one another)
  • Spatial data tend to exhibit positive spatial covariance (nearby observations are more similar than distant ones)

A Few Spatial Processes

A spatial process is typically governed by some parameters that control the strength of spatial covariance

  • The stronger the spatial covariance in the spatial process, the more pronounced the spatial patterning
  • Recall the observed spatial process is a realization (i.e., draw) from a more general random variable

A Few Spatial Processes

A few realizations from a strong and a weak spatial process

Figure 6: A random variable with strong spatial covariance.
Figure 7: A random variable with strong spatial covariance.
Figure 8: A random variable with strong spatial covariance.
Figure 9: A random variable with weak spatial covariance.
Figure 10: A random variable with weak spatial covariance.
Figure 11: A random variable with weak spatial covariance.

Measuring Spatial Covariance

We measure spatial covariance between two observations using a spatial covariance function:

\[\text{Cov}(h) = \sigma^2_{de} f(h, \phi) + \sigma^2_{ie}I\{h = 0\}\]

  • \(h\) is the distance between the two locations

Measuring Spatial Covariance

We measure spatial covariance between two observations using a spatial covariance function:

\[\text{Cov}(h) = \sigma^2_{de} f(h, \phi) + \sigma^2_{ie}I\{h = 0\}\]

  • \(\sigma^2_{de}\) is the spatially dependent variance (i.e., partial sill)
  • \(f(h, \phi)\) is a function of both distance (\(h\)) and a range parameter (\(\phi\))

For the exponential

\[f(h, \phi) = \exp(-h/\phi)\]

For the Gaussian

\[f(h, \phi) = \exp(-h^2/\phi^2)\]

For the spherical

\[f(h, \phi) = (1 - 1.5\frac{h}{\phi} + 0.5 \frac{h^3}{\phi^3})I\{h <= \phi \} \]

Measuring Spatial Covariance

We measure spatial covariance between two observations using a spatial covariance function:

\[\text{Cov}(h) = \sigma^2_{de} f(h, \phi) + \sigma^2_{ie}I\{h = 0\}\]

  • \(\sigma^2_{ie}\) is the independent (i.e., nonspatial) variance (i.e., nugget)
  • \(I\{\cdot\}\) is an indicator function
  • Resampling variability (what if I sampled again tomorrow?)

Measuring Spatial Covariance

We measure spatial covariance between two observations using a spatial covariance function:

\[\text{Cov}(h) = \sigma^2_{de} f(h, \phi) + \sigma^2_{ie}I\{h = 0\}\]

  • The total variance is \(\sigma^2_{de} + \sigma^2_{ie}\) (i.e., the sill)
  • Sometimes nomenclature differs!

Visualizing Spatial Covariance

Figure 12: Examples of several different spatial covariance functions.

Let’s Recap

The takeaways

  • Spatial data tend to be more similar at nearby locations than distant locations (Tobler’s Law)
  • Measure spatial covariance using a spatial covariance function (distance, variance parameters, range parameter)
  • Spatial linear models are more realistic and effective for spatial data (than nonspatial linear models)

Let’s Recap

What’s coming next

  • Describe a linear model for spatial data (i.e., the spatial linear model)
  • Fit spatial linear models using splm()
  • Estimate the spatial covariance of \(y\) given explanatory variables
  • Model inference, diagnostics, comparison
  • Prediction at new locations (i.e., Kriging)

The Spatial Linear Model

The Spatial Linear Model

The spatial linear model

\[ y = \beta_0 + \beta_1 x_1 + ... + \beta_p x_p + \tau + \epsilon \]

  • \(y\) is the response variable
  • \(x\)’s are explanatory variables
  • \(\beta\)’s are fixed effect parameters

The Spatial Linear Model

The spatial linear model

\[ y = \beta_0 + \beta_1 x_1 + ... + \beta_p x_p + \tau + \epsilon \]

\(\tau\) is spatially dependent random error

  • \(\text{Cov}(\tau) = \sigma^2_{de}f(h, \phi)\)

\(\epsilon\) is independent (i.e., nonspatial) random error

  • \(\text{Cov}(\epsilon) = \sigma^2_{ie}I\{h = 0\}\)

Everything we have learned about spatial covariance translates to estimation of the parameters governing \(\tau\) and \(\epsilon\)

The Spatial Linear Model

The inclusion of \(\tau\) makes the model a spatial linear model

  • The structure of each \(x_i\) is irrelevant

  • The following model is a nonspatial linear model

\[ y = \beta_0 + \beta_1 x_1 + \epsilon \]

  • The following model is a spatial linear model

\[ y = \beta_0 + \beta_1 x_1 + \tau + \epsilon \]

How are Parameters Estimated?

Several parameter estimation methods (estmethod)

  • Restricted maximum likelihood ("reml"; the default)
  • Maximum likelihood ("ml")
  • Cressie’s semivariogram-based weighted least squares ("sv-wls"; see weights)
  • Curriero and Lele’s semivariogram-based composite likelihood ("sv-cl")

How are Parameters Estimated?

The important point: The explanatory variables are accounted for while estimating the spatial covariance parameters (and vice versa)

See the references for more details

  • "reml": Patterson and Thompson (1971); Harville (1977); Wolfinger, Tobias, and Sall (1994)
  • "ml": Wolfinger, Tobias, and Sall (1994)
  • "sv-wls": Cressie (1985)
  • "sv-cl": Curriero and Lele (1999)

The lake Data

The lake data in spmodel is an sf object (Pebesma 2018) with data from 102 lakes in four southwestern US states (Arizona, Colorado, Nevada, Utah):

lake
Simple feature collection with 102 features and 8 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -2004016 ymin: 1031593 xmax: -753669.4 ymax: 2338804
Projected CRS: NAD83 / Conus Albers
# A tibble: 102 × 9
   comid    log_cond state  temp precip   elev origin     year 
 * <chr>       <dbl> <chr> <dbl>  <dbl>  <dbl> <chr>      <fct>
 1 20451100     6.32 AZ    12.7   4.94  1567   HUMAN_MADE 2012 
 2 20476542     7.02 AZ    21.8   5.78   459   HUMAN_MADE 2012 
 3 10001770     7.13 AZ    23.2   0.912   69.1 HUMAN_MADE 2012 
 4 20584396     6.17 AZ    11.2   4.44  1822   HUMAN_MADE 2012 
 5 20524727     5.48 AZ     8.31  6.12  2168   HUMAN_MADE 2012 
 6 20479908     7.00 AZ    22.4   2.31   366.  HUMAN_MADE 2012 
 7 10001834     7.85 AZ    23.5   0.989   57.7 NATURAL    2012 
 8 20695686     4.77 AZ     9.13  5.91  2072.  HUMAN_MADE 2012 
 9 21327603     5.30 AZ    17.9   2.83  1006.  HUMAN_MADE 2012 
10 20449310     4.33 AZ     9.79  6.14  1998.  HUMAN_MADE 2012 
# ℹ 92 more rows
# ℹ 1 more variable: geometry <POINT [m]>

Visualizing Log Conductivity

Figure 13: The (natural) logarithm of conductivity at lakes in the southwestern United States.

Fit a Spatial Linear Model

Fit a spatial linear model to study the impact of elevation and lake origin on (log) conductivity while accounting for spatial covariance

spmod <- splm(
  formula = log_cond ~ elev + origin,
  data = lake,
  spcov_type = "exponential"
)
  • If data is not an sf object, provide the xcoord and ycoord arguments to splm()

Summarize the Model

summary(spmod)

Call:
splm(formula = log_cond ~ elev + origin, data = lake, spcov_type = "exponential")

Residuals:
    Min      1Q  Median      3Q     Max 
-1.4556 -0.2682  0.4872  0.8886  3.4064 

Coefficients (fixed):
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)    8.4214218  0.5716120  14.733   <2e-16 ***
elev          -0.0016519  0.0001284 -12.862   <2e-16 ***
originNATURAL -0.0174953  0.1960494  -0.089    0.929    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pseudo R-squared: 0.6312

Coefficients (exponential spatial covariance):
       de        ie     range 
6.101e-01 4.713e-01 8.205e+05 

While informative, this output can be hard to use directly because it is printed to the R console

The broom Functions

The broom R package (Robinson, Hayes, and Couch 2025) converts statistical model output into tidy tibbles:

  • tidy() tidies the model output
  • glance() glances at the model fit
  • augment() augments data with model diagnostics (later, with predictions)

Tidying the Model

tidy(spmod)
# A tibble: 3 × 5
  term          estimate std.error statistic p.value
  <chr>            <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)    8.42     0.572      14.7      0    
2 elev          -0.00165  0.000128  -12.9      0    
3 originNATURAL -0.0175   0.196      -0.0892   0.929
tidy(spmod, conf.int = TRUE)
# A tibble: 3 × 7
  term          estimate std.error statistic p.value conf.low conf.high
  <chr>            <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
1 (Intercept)    8.42     0.572      14.7      0      7.30      9.54   
2 elev          -0.00165  0.000128  -12.9      0     -0.00190  -0.00140
3 originNATURAL -0.0175   0.196      -0.0892   0.929 -0.402     0.367  

Glancing at the Model

glance(spmod)
# A tibble: 1 × 10
      n     p  npar value   AIC  AICc   BIC logLik deviance pseudo.r.squared
  <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl>  <dbl>    <dbl>            <dbl>
1   102     3     3  255.  261.  261.  269.  -127.       99            0.631
  • n: Sample size
  • p: Number of (estimated) explanatory variables
  • npar: Number of (estimated) spatial covariance parameters
  • Likelihood-based statistics
  • pseudo.r.squared: Pseudo R-squared, the variability attributable to the explanatory variables

Augmenting the Model

augment(spmod)
Simple feature collection with 102 features and 8 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -2004016 ymin: 1031593 xmax: -753669.4 ymax: 2338804
Projected CRS: NAD83 / Conus Albers
# A tibble: 102 × 9
   log_cond   elev origin     .fitted .resid   .hat  .cooksd .std.resid
 *    <dbl>  <dbl> <chr>        <dbl>  <dbl>  <dbl>    <dbl>      <dbl>
 1     6.32 1567   HUMAN_MADE    5.83  0.484 0.0104 0.00265       0.872
 2     7.02  459   HUMAN_MADE    7.66 -0.641 0.0356 0.00265      -0.464
 3     7.13   69.1 HUMAN_MADE    8.31 -1.18  0.0556 0.0289       -1.21 
 4     6.17 1822   HUMAN_MADE    5.41  0.756 0.0134 0.00330       0.853
 5     5.48 2168   HUMAN_MADE    4.84  0.636 0.0132 0.00363       0.901
 6     7.00  366.  HUMAN_MADE    7.82 -0.817 0.0364 0.00543      -0.657
 7     7.85   57.7 NATURAL       8.31 -0.457 0.114  0.000708     -0.129
 8     4.77 2072.  HUMAN_MADE    5.00 -0.225 0.0130 0.000491     -0.335
 9     5.30 1006.  HUMAN_MADE    6.76 -1.46  0.0328 0.0361       -1.79 
10     4.33 1998.  HUMAN_MADE    5.12 -0.791 0.0114 0.00475      -1.11 
# ℹ 92 more rows
# ℹ 1 more variable: geometry <POINT [m]>

Augmenting the Model

The components of augment():

  • .fitted: The fitted values, \(X \hat{\beta}\)
  • .resid: The residuals, \(y - X \hat{\beta}\)
  • .hat: The hat (i.e., leverage) values
  • .cooksd: The Cook’s distance
  • .std.resid: The standardized residuals (decorrelated, unit variance)
  • Can also use fitted(), residuals(), hatvalues(), cooks.distance(), rstandard()

Checking Model Assumptions

Use standardized residuals to check model assumptions as in a nonspatial linear model

plot(spmod, which = 1)

Figure 14: Standardized residuals vs fitted values. There is no obvious pattern between the standardized residuals and fitted values.

Model Diagnostics

The fitted spatial covariance (after accounting for the explanatory variables):

\[\text{Cov}(h) = 0.61\exp(-h/820{,}500) + 0.47I\{h = 0 \}\]

Model Diagnostics

The fitted spatial covariance (after accounting for the explanatory variables):

plot(spmod, which = 7)

Figure 15: The fitted spatial covariance of log conductivity (after accounting for the explanatory variables).

Spatial Prediction (i.e., Kriging)

Often, a primary goal is to predict the spatial process at new locations

  • Kriging, best linear unbiased prediction, spatial prediction, prediction

Explanatory variables must also be in the prediction data

The lake_preds Data

Predict log conductivity at the 10 lakes in lake_preds

lake_preds
Simple feature collection with 10 features and 7 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -2026720 ymin: 1249440 xmax: -1095597 ymax: 2078471
Projected CRS: NAD83 / Conus Albers
# A tibble: 10 × 8
   comid    state  temp precip  elev origin     year            geometry
 * <chr>    <chr> <dbl>  <dbl> <dbl> <chr>      <fct>        <POINT [m]>
 1 20438850 AZ    20.7   4.83   533. HUMAN_MADE 2012  (-1428511 1315340)
 2 21411459 AZ    23.4   0.923   47  HUMAN_MADE 2017  (-1707808 1249440)
 3 20595958 NV    10.6   3.72  1604  HUMAN_MADE 2012  (-1581102 1804961)
 4 10693615 NV     6.51  5.18  2256. HUMAN_MADE 2012  (-1795120 2002673)
 5 8942419  NV     7.01  6.65  2124. HUMAN_MADE 2012  (-2026720 2043246)
 6 20704361 UT     5.10  7.22  2569. NATURAL    2012  (-1450440 1770802)
 7 10038630 UT     8.76  2.19  1653. NATURAL    2012  (-1095597 2060788)
 8 10037090 UT     1.90  5.68  2804. NATURAL    2012  (-1175066 2077765)
 9 10389176 UT    11.4   4.09  1291. NATURAL    2012  (-1329835 2078471)
10 4900285  UT     2.28  5.79  3098  NATURAL    2017  (-1337946 1794833)

Visualizing lake_preds

Figure 16: The (natural) logarithm of conductivity at lakes in the southwestern United States alongside log conductivity predictions at new lakes (black triangles).

Spatial Prediction

Can use predict() or augment():

predict(spmod, newdata = lake_preds)
       1        2        3        4        5        6        7        8 
7.057347 7.749696 6.199659 4.841143 4.947082 4.889095 6.075485 4.047063 
       9       10 
7.009061 4.155508 
augment(spmod, newdata = lake_preds)
Simple feature collection with 10 features and 8 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -2026720 ymin: 1249440 xmax: -1095597 ymax: 2078471
Projected CRS: NAD83 / Conus Albers
# A tibble: 10 × 9
   comid state  temp precip  elev origin year  .fitted           geometry
 * <chr> <chr> <dbl>  <dbl> <dbl> <chr>  <fct>   <dbl>        <POINT [m]>
 1 2043… AZ    20.7   4.83   533. HUMAN… 2012     7.06 (-1428511 1315340)
 2 2141… AZ    23.4   0.923   47  HUMAN… 2017     7.75 (-1707808 1249440)
 3 2059… NV    10.6   3.72  1604  HUMAN… 2012     6.20 (-1581102 1804961)
 4 1069… NV     6.51  5.18  2256. HUMAN… 2012     4.84 (-1795120 2002673)
 5 8942… NV     7.01  6.65  2124. HUMAN… 2012     4.95 (-2026720 2043246)
 6 2070… UT     5.10  7.22  2569. NATUR… 2012     4.89 (-1450440 1770802)
 7 1003… UT     8.76  2.19  1653. NATUR… 2012     6.08 (-1095597 2060788)
 8 1003… UT     1.90  5.68  2804. NATUR… 2012     4.05 (-1175066 2077765)
 9 1038… UT    11.4   4.09  1291. NATUR… 2012     7.01 (-1329835 2078471)
10 4900… UT     2.28  5.79  3098  NATUR… 2017     4.16 (-1337946 1794833)

Spatial Prediction

Combining augment() and sf objects helps visualize predictions spatially:

Figure 17: The (natural) logarithm of conductivity at lakes in the southwestern United States alongside log conductivity predictions (triangles).

Spatial Prediction

Prediction intervals using interval:

predict(spmod, newdata = lake_preds, interval = "prediction")
        fit      lwr      upr
1  7.057347 5.587958 8.526737
2  7.749696 6.184041 9.315351
3  6.199659 4.684288 7.715029
4  4.841143 3.263337 6.418949
5  4.947082 3.346535 6.547628
6  4.889095 3.394423 6.383766
7  6.075485 4.594883 7.556087
8  4.047063 2.618076 5.476051
9  7.009061 5.525522 8.492599
10 4.155508 2.675686 5.635331
augment(spmod, newdata = lake_preds, interval = "prediction")
Simple feature collection with 10 features and 10 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -2026720 ymin: 1249440 xmax: -1095597 ymax: 2078471
Projected CRS: NAD83 / Conus Albers
# A tibble: 10 × 11
   comid    state  temp precip  elev origin     year  .fitted .lower .upper
 * <chr>    <chr> <dbl>  <dbl> <dbl> <chr>      <fct>   <dbl>  <dbl>  <dbl>
 1 20438850 AZ    20.7   4.83   533. HUMAN_MADE 2012     7.06   5.59   8.53
 2 21411459 AZ    23.4   0.923   47  HUMAN_MADE 2017     7.75   6.18   9.32
 3 20595958 NV    10.6   3.72  1604  HUMAN_MADE 2012     6.20   4.68   7.72
 4 10693615 NV     6.51  5.18  2256. HUMAN_MADE 2012     4.84   3.26   6.42
 5 8942419  NV     7.01  6.65  2124. HUMAN_MADE 2012     4.95   3.35   6.55
 6 20704361 UT     5.10  7.22  2569. NATURAL    2012     4.89   3.39   6.38
 7 10038630 UT     8.76  2.19  1653. NATURAL    2012     6.08   4.59   7.56
 8 10037090 UT     1.90  5.68  2804. NATURAL    2012     4.05   2.62   5.48
 9 10389176 UT    11.4   4.09  1291. NATURAL    2012     7.01   5.53   8.49
10 4900285  UT     2.28  5.79  3098  NATURAL    2017     4.16   2.68   5.64
# ℹ 1 more variable: geometry <POINT [m]>

Model Comparison

Motivated spatial linear models via first principles (Tobler’s Law), but are they really a better fit?

  • Fit a nonspatial linear model using splm(..., spcov_type = "none")
lmod <- splm(log_cond ~ elev + origin, data = lake, spcov_type = "none")

Why use splm(..., spcov_type = "none") instead of lm()?

  • To access some spmodel-specific helper functions like loocv()

Model Comparison

The spatial model is strongly preferred via

  1. Likelihood-based statistics (AIC, AICc, BIC):
glances(lmod, spmod)
# A tibble: 2 × 11
  model     n     p  npar value   AIC  AICc   BIC logLik deviance
  <chr> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl>  <dbl>    <dbl>
1 spmod   102     3     3  255.  261.  261.  269.  -127.       99
2 lmod    102     3     1  270.  272.  272.  274.  -135.       99
# ℹ 1 more variable: pseudo.r.squared <dbl>

Lower values of AIC, AICc, BIC

Model Comparison

The spatial model is strongly preferred via

  1. Leave-one-out cross validation:
loocv(lmod)
# A tibble: 1 × 4
      bias  MSPE RMSPE  cor2
     <dbl> <dbl> <dbl> <dbl>
1 -0.00326 0.714 0.845 0.600
loocv(spmod)
# A tibble: 1 × 4
       bias  MSPE RMSPE  cor2
      <dbl> <dbl> <dbl> <dbl>
1 -0.000703 0.564 0.751 0.684

Unbiased, lower values of MSPE/RMSPE, higher values of predictive R-squared (cor2)

A Cautionary Note

Pseudo R-squared is a useful diagnostic but is not a model selection tool

pseudoR2(lmod)
[1] 0.6237768
pseudoR2(spmod)
[1] 0.6312211

Predictive R-squared (cor2 from loocv()) can be used as a model selection tool

A Cautionary Note

Confusion arises from the R-squared having two meanings simultaneously (when models assume independence)

  1. Variability attributable to the explanatory variables (pseudo R-squared)
  2. Squared correlation between predicted values and data (predictive R-squared)

These definitions are not equivalent when models do not assume independence

Model Comparison

What about comparing different functional forms?

  • Don’t know which to choose? Start with "exponential" and "gaussian"
spmod_list <- splm(
  formula = log_cond ~ elev + origin,
  data = lake,
  spcov_type = c("exponential", "gaussian", "none")
)
glances(spmod_list)
# A tibble: 3 × 11
  model           n     p  npar value   AIC  AICc   BIC logLik deviance
  <chr>       <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl>  <dbl>    <dbl>
1 gaussian      102     3     3  255.  261.  261.  269.  -127.     99.0
2 exponential   102     3     3  255.  261.  261.  269.  -127.     99  
3 none          102     3     1  270.  272.  272.  274.  -135.     99  
# ℹ 1 more variable: pseudo.r.squared <dbl>

Model Comparison

Don’t stress too much about choosing the best spatial covariance function

  • Typically, the gap between the nonspatial model and the worst spatial model is much larger than the gap between the worst spatial model and the best spatial model
  • First focus on research questions, data quality, omitted explanatory variables, etc.

Let’s Recap

The takeaways

  • Spatial linear models extend linear models to account for spatial covariance
  • Fit spatial linear models using splm() in spmodel
  • The broom functions: tidy(), glance(), and augment()
  • Tools for model inference, diagnostics, prediction, and comparison

Let’s Recap

What’s coming next

  • An overview of additional spmodel features

  • Some closing thoughts

  • If time, focus on

    • applications to large data sets
    • spatial generalized linear models

Some Additional spmodel Features

Some Additional spmodel Features

Additional arguments to splm()

  • anisotropy, random, partition_factor, spcov_initial, estmethod, etc.

Additional modeling tools

  • anova(), emmeans, car, predict(..., block), varcomp(), etc.

Applications to large data sets

  • The local argument

Spatial generalized linear models

  • spglm()

Some Additional spmodel Features

Spatial linear and generalized models for areal (i.e., polygon) data

  • spautor() and spgautor()

Exploratory tools

  • esv(), eacf()

Simulating spatial data

  • sprnorm(), sprbinom(), etc.

Spatial machine learning tools

  • e.g., splmRF()

And more coming!

Closing Thoughts

What’s the Pitch?

What are some reasons to add spmodel to your toolbox?

  • A general tool that extends lm() and glm() for spatial data
  • Many additional features (see previous slide)
  • Integrated with existing R ecosystem (e.g., base, stats, broom, emmeans, car)
  • In active development

Reviewer question: “Did you study the impact of spatial covariance/autocovariance/correlation/autocorrelation on your data?”

  • You can say yes!

Alternatives

What are some reasons to prefer an alternative to spmodel?

Citation Information

If you use spmodel, please do cite it

  • Citing spmodel helps us keep track of uses and justifies future development resources (e.g., bugfixes, updates, etc.)
citation(package = "spmodel")
To cite spmodel in publications use:

  Dumelle M, Higham M, Ver Hoef JM (2023). spmodel: Spatial statistical
  modeling and prediction in R. PLOS ONE 18(3): e0282524.
  https://doi.org/10.1371/journal.pone.0282524

A BibTeX entry for LaTeX users is

  @Article{,
    title = {{spmodel}: Spatial statistical modeling and prediction in {R}},
    author = {Michael Dumelle and Matt Higham and Jay M. {Ver Hoef}},
    journal = {PLOS ONE},
    year = {2023},
    volume = {18},
    number = {3},
    pages = {1--32},
    doi = {10.1371/journal.pone.0282524},
    url = {https://doi.org/10.1371/journal.pone.0282524},
  }

Where to Learn More?

For the most up to date resources, check out our website: https://usepa.github.io/spmodel/

  • Function documentation in “Reference” Tab

  • Vignettes in “Articles” Tab (6+) and links therein (e.g., workshop materials)

What’s Next for spmodel?

What things are we excited about?

  • Paper acceptance: “Spatial generalized linear models in R Using spmodel” at Journal of Statistical Software (expected publication 2026)
  • Textbook (open-source): “A practical introduction to spatial statistics in R” (title subject to change; expected publication 2028)
  • Additional features: Spatio-temporal models, conditional simulation, enhanced machine learning tools, and more

What’s Next for spmodel?

We are always interested in hearing from YOU! We want spmodel to be as useful as possible

  • My email: dumelle.michael@epa.gov (primary); michael.dumelle@oregonstate.edu (secondary)

  • Bug reports and suggestions: https://github.com/USEPA/spmodel/issues

    • A few recent user suggestions: emmeans integration, robust and cloud empirical semivariograms, empirical autocovariance functions
  • Open to potential seminars, workshops, other collaborations, etc. Email me!

Thank You!

A special thank you to

  • ESA Statistical Ecology Section, ESA, and EFI for hosting and the invitation!
  • Developers Matt Higham and Jay M. Ver Hoef
  • USEPA colleagues Darin Kopp, Ryan A. Hill, Amanda M. Nahlik
  • You for attending!

Thank You!

Contact me at

  • dumelle.michael@epa.gov (primary email)
  • michael.dumelle@oregonstate.edu (secondary email)

Questions?

Applications to large data sets? Spatial generalized linear models?

Applications to Large Data Sets

Simulating Some Data

sprnorm() for simulating spatial Gaussian data

set.seed(1)
n <- 2500
xc <- runif(n)
yc <- runif(n)
x <- runif(n)
dat <- data.frame(xc, yc, x)
params <- spcov_params(spcov_type = "exponential", de = 1, ie = 0.2, range = 1)
mu <- 1 + x
dat$y <- sprnorm(params, mean = mu, data = dat, xcoord = xc, ycoord = yc)
n_train <- 2000
train_dat <- dat[seq(1, n_train), ]
test_dat <- dat[-seq(1, n_train), ]
Figure 18: The simulated response variable in the training data.
Figure 19: The simulated explanatory variable in the training data.
Figure 20: The simulated response and explanatory variables in the training data.

Spatial Indexing

The computational challenge for model fitting is inverting an \(n \times n\) matrix

  • Doubling the sample size takes between 4 and 8 times as long (depending on estimation method)

Solution? Split large data set up into smaller data sets and pool results (Ver Hoef et al. 2023; Dumelle, Ver Hoef, et al. 2024)

  • For a review of other approaches to the big data spatial problem, see Heaton et al. (2019)

Assigning SPIN Groups

Two examples of ways to assign observations to groups

  • kmeans clustering:
coords <- data.frame(train_dat$xc, train_dat$yc)
dists <- coords |>
  dist() |>
  as.matrix()
kmeans_output <- kmeans(dists, centers = 4)
train_dat$kmeans <- kmeans_output$cluster |>
  as.factor()
  • random assignment:
index <- rep(1:4, length.out = n_train)
train_dat$random <- index |>
  sample() |>
  as.factor()

Assigning SPIN Groups

Figure 21: Spatial indexing groups chosen via kmeans clustering.
Figure 22: Spatial indexing groups chosen via random assignment.

The local Argument

A logical argument (TRUE/FALSE) or a list to splm() that controls SPIN details:

  • The group assignment method (kmeans, random)
  • The number of groups (or the number of observations in a group)
  • The variance adjustment
  • Parallel processing and cores used

The local Argument

local = TRUE is shorthand for

local_list <- list(
  method = "kmeans",
  size = 100,
  var_adjust = "theoretical",
  parallel = FALSE
)
splm(..., local = local_list)

Model Fitting

How do the fit times compare?

spmod_direct_start <- Sys.time()
spmod_direct <- splm(
  formula = y ~ x,
  data = train_dat,
  spcov_type = "exponential",
  xcoord = xc,
  ycoord = yc
)
spmod_direct_time <- Sys.time() - spmod_direct_start
spmod_bigdata_start <- Sys.time()
spmod_bigdata <- splm(
  formula = y ~ x,
  data = train_dat,
  spcov_type = "exponential",
  xcoord = xc,
  ycoord = yc,
  local = TRUE
)
spmod_bigdata_time <- Sys.time() - spmod_bigdata_start
spmod_direct_time
Time difference of 36.92659 secs
spmod_bigdata_time
Time difference of 1.445589 secs

The performance gap will increase with the size of the observed data, \(n\)

For “lunch-break” computational times:

  • Direct approach limit \(n \approx 10{,}000\)
  • SPIN approach (adjusted variance) limit \(n \approx 250{,}000\)
  • SPIN approach (unadjusted variance) limit \(n \approx 1{,}000{,}000\)
  • Very approximate guidelines

Inference Comparison

tidy(spmod_direct) |>
  filter(term == "x")
# A tibble: 1 × 5
  term  estimate std.error statistic p.value
  <chr>    <dbl>     <dbl>     <dbl>   <dbl>
1 x         1.00    0.0370      27.1       0
tidy(spmod_bigdata) |>
  filter(term == "x")
# A tibble: 1 × 5
  term  estimate std.error statistic p.value
  <chr>    <dbl>     <dbl>     <dbl>   <dbl>
1 x         1.01    0.0374      27.0       0

Cross-Validation Comparison

loocv(spmod_direct)
# A tibble: 1 × 4
      bias  MSPE RMSPE  cor2
     <dbl> <dbl> <dbl> <dbl>
1 0.000205 0.225 0.475 0.622
loocv(spmod_bigdata)
# A tibble: 1 × 4
      bias  MSPE RMSPE  cor2
     <dbl> <dbl> <dbl> <dbl>
1 0.000240 0.225 0.475 0.622

Spatial Covariance Comparison

coef(spmod_direct, type = "spcov")[1:3]
       de        ie     range 
0.5012572 0.1908499 0.4776605 
coef(spmod_bigdata, type = "spcov")[1:3]
       de        ie     range 
0.2980355 0.1890155 0.2508591 
Figure 23: Spatial covariance from the direct model.
Figure 24: Spatial covariance from the spatial indexing (SPIN) model.

Spatial Prediction

The computational challenge for spatial prediction is inverting an \(n \times n\) matrix

  • \(n\) is the size of the observed data

Solution? Only use the \(n_{sub}\) observations closest to the prediction location

The local Argument

A logical argument (TRUE/FALSE) or a list to predict() that controls LNBH details:

  • The neighborhood assignment method (distance, covariance, all)
  • The number of \(n_{sub}\) observations to use for each prediction location
  • Parallel processing and cores used

The local Argument

local = TRUE is shorthand for

local_list <- list(
  method = "covariance",
  size = 100,
  parallel = FALSE
)
predict(object, ..., local = local_list)

Spatial Prediction

How do the LNBH prediction times compare?

preds_direct_start <- Sys.time()
preds_direct <- predict(spmod_direct, newdata = test_dat, se.fit = TRUE)
preds_direct_time <- Sys.time() - preds_direct_start
preds_bigdata_start <- Sys.time()
preds_bigdata <- predict(spmod_bigdata, newdata = test_dat, se.fit = TRUE)
preds_bigdata_time <- Sys.time() - preds_bigdata_start
preds_direct_time
Time difference of 3.684309 secs
preds_bigdata_time
Time difference of 3.499126 secs

The performance gap will increase with the size of the observed data, \(n\)

For “lunch-break” computational times:

  • Direct approach limit \(n \approx 10{,}000\), \(n_{sub} \approx 100{,}000\)
  • LNBH approach limit \(n \approx 100{,}000\), \(n_{sub} \approx 100{,}000\)
  • Parallel processing can speed this up significantly (more benefit than model fitting)
  • Very approximate guidelines

Spatial Prediction

Figure 25: The simulated response variable in the test data.
Figure 26: Predictions for the direct and big data (LNBH) approaches.
Figure 27: Standard Errors for the direct and big data (LNBH) approaches.

Compute the root-mean-squared-prediction error on the test data for the direct and big data (LNBH) approaches

rmspe_direct <- sqrt(mean((test_dat$y - test_dat$direct)^2))
rmspe_direct
[1] 1.442081
rmspe_bigdata <- sqrt(mean((test_dat$y - test_dat$bigdata)^2))
rmspe_bigdata
[1] 1.442876

Let’s Recap

For model fitting, spatial indexing (SPIN) breaks up the full data into subsets and pools the results

  • Provide local to splm()
  • local = TRUE implements defaults

For prediction, the local neighborhood (LNBH) approach uses a subset of the observed data to inform each prediction

  • Provide local to predict() or augment()
  • local = TRUE implements defaults

Spatial Generalized Linear Models

The moose Data

An sf object with data at 218 moose survey locations in the Togiak Region of Alaska, US

moose
Simple feature collection with 218 features and 4 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 269085 ymin: 1416151 xmax: 419976.2 ymax: 1541763
Projected CRS: NAD83 / Alaska Albers
# A tibble: 218 × 5
    elev strat count presence           geometry
   <dbl> <chr> <dbl> <fct>           <POINT [m]>
 1  469. L         0 0        (293542.6 1541016)
 2  362. L         0 0        (298313.1 1533972)
 3  173. M         0 0        (281896.4 1532516)
 4  280. L         0 0        (298651.3 1530264)
 5  620. L         0 0        (311325.3 1527705)
 6  164. M         0 0        (291421.5 1518398)
 7  164. M         0 0        (287298.3 1518035)
 8  186. L         0 0        (279050.9 1517324)
 9  362. L         0 0        (346145.9 1512479)
10  430. L         0 0        (321354.6 1509966)
# ℹ 208 more rows

Moose Presence

Figure 28: Moose presence in the Togiak Region.

The Need for a GLM

Moose presence is binary, so the errors can’t possibly be (near) Gaussian!

  • For one thing, they are bounded between -1 and 1

The GLM

A generalized linear model (GLM) models, linearly, a function of the mean (\(\text{E}(y)\), or \(\mu\)) of the response

\[ f(\mu) = \beta_0 + \beta_1 x_1 + ... + \beta_p x_p\]

  • \(f(\cdot)\) is called a link function

\[ \mu = f^{-1}(\beta_0 + \beta_1 x_1 + ... + \beta_p x_p)\]

  • \(f^{-1}(\cdot)\) is called an inverse link function

The Spatial GLM

The spatial GLM adds back \(\tau\) and \(\epsilon\) to the linear scale

\[ f(\mu) = \beta_0 + \beta_1 x_1 + ... + \beta_p x_p + \tau + \epsilon\]

  • \(\tau\) and \(\epsilon\) influence the response mean
  • \(\text{Cov}(\tau) = \sigma^2_{de}f(h, \phi)\)
  • \(\text{Cov}(\epsilon) = \sigma^2_{ie}I\{h = 0\}\)

Spatial GLMs with spglm()

The spglm() function extends glm() and splm():

  • The family argument describes the response type
spgmod <- spglm(
  formula = presence ~ elev * strat,
  family = binomial,
  data = moose,
  spcov_type = "spherical"
)

For fitting details, see Ver Hoef et al. (2024)

Interpreting Slope Coefficients

In a GLM, slope coefficients are interpreted on the link (here, log odds) scale

tidy(spgmod)
# A tibble: 4 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept) -3.04      1.21        -2.52 0.0117 
2 elev         0.00913   0.00413      2.21 0.0269 
3 stratM       3.28      1.16         2.82 0.00483
4 elev:stratM -0.0109    0.00670     -1.62 0.104  

Other broom Functions

The other broom functions work similarly

glance(spgmod)
# A tibble: 1 × 10
      n     p  npar value   AIC  AICc   BIC logLik deviance pseudo.r.squared
  <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl>  <dbl>    <dbl>            <dbl>
1   218     4     3  681.  687.  687.  697.  -340.     161.           0.0885
augment(spgmod)
Simple feature collection with 218 features and 8 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 269085 ymin: 1416151 xmax: 419057.4 ymax: 1541016
Projected CRS: NAD83 / Alaska Albers
# A tibble: 218 × 9
   presence  elev strat .fitted .resid    .hat   .cooksd .std.resid
 * <fct>    <dbl> <chr>   <dbl>  <dbl>   <dbl>     <dbl>      <dbl>
 1 0         469. L      -1.47  -0.644 0.101   0.0130        -0.679
 2 0         362. L      -2.77  -0.349 0.0166  0.000523      -0.352
 3 0         173. M      -2.23  -0.451 0.00390 0.000200      -0.452
 4 0         280. L      -3.59  -0.234 0.00343 0.0000472     -0.234
 5 0         620. L      -0.774 -0.871 0.319   0.130         -1.06 
 6 0         164. M      -2.01  -0.502 0.00459 0.000292      -0.503
 7 0         164. M      -1.74  -0.569 0.00559 0.000458      -0.571
 8 0         186. L      -2.29  -0.440 0.00549 0.000268      -0.441
 9 0         362. L      -1.82  -0.549 0.0358  0.00291       -0.559
10 0         430. L      -1.23  -0.716 0.0898  0.0139        -0.751
# ℹ 208 more rows
# ℹ 1 more variable: geometry <POINT [m]>

Other Fit Metrics

Area under the receiver operator characteristic (AUROC) curve:

AUROC(spgmod)
[1] 0.9490741

Leave-one-out cross validation (on the response scale):

loocv(spgmod)
# A tibble: 1 × 3
       bias  MSPE RMSPE
      <dbl> <dbl> <dbl>
1 -0.000567 0.146 0.382

The moose_preds Data

moose_preds
Simple feature collection with 100 features and 2 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 269085 ymin: 1416151 xmax: 419976.2 ymax: 1541763
Projected CRS: NAD83 / Alaska Albers
# A tibble: 100 × 3
    elev strat           geometry
   <dbl> <chr>        <POINT [m]>
 1  143. L     (401239.6 1436192)
 2  324. L     (352640.6 1490695)
 3  158. L     (360954.9 1491590)
 4  221. M     (291839.8 1466091)
 5  209. M     (310991.9 1441630)
 6  218. L     (304473.8 1512103)
 7  127. L     (339011.1 1459318)
 8  122. L     (342827.3 1463452)
 9  191  L     (284453.8 1502837)
10  105. L     (391343.9 1483791)
# ℹ 90 more rows

The moose_preds Data

Figure 29: Moose observed data and prediction locations.

Spatial Prediction

Predicting the underlying mean probability of moose presence (not the response)

preds <- predict(spgmod, newdata = moose_preds, type = "response")
head(preds)
        1         2         3         4         5         6 
0.5214583 0.3994820 0.1329517 0.2420033 0.8110757 0.0604971 
augment(spgmod, newdata = moose_preds, interval = "prediction")
Simple feature collection with 100 features and 5 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 269386.2 ymin: 1418453 xmax: 419976.2 ymax: 1541763
Projected CRS: NAD83 / Alaska Albers
# A tibble: 100 × 6
    elev strat .fitted .lower  .upper           geometry
 * <dbl> <chr>   <dbl>  <dbl>   <dbl>        <POINT [m]>
 1  143. L      0.0859 -2.22   2.39   (401239.6 1436192)
 2  324. L     -0.408  -3.37   2.56   (352640.6 1490695)
 3  158. L     -1.88   -4.64   0.889  (360954.9 1491590)
 4  221. M     -1.14   -3.62   1.34   (291839.8 1466091)
 5  209. M      1.46   -0.898  3.81   (310991.9 1441630)
 6  218. L     -2.74   -5.62   0.138  (304473.8 1512103)
 7  127. L     -3.24   -6.09  -0.386  (339011.1 1459318)
 8  122. L     -2.81   -5.54  -0.0828 (342827.3 1463452)
 9  191  L     -0.770  -3.95   2.41   (284453.8 1502837)
10  105. L     -1.31   -3.70   1.07   (391343.9 1483791)
# ℹ 90 more rows

Spatial Prediction

Figure 30: Fitted values and predictions for moose presence probability.

Other GLM Families

Generalized linear model response variable types, families, and link functions supported by spmodel.
type family link
binary binomial logit
count poisson log
count nbinomial log
proportion beta logit
skewed Gamma log
skewed inverse.gaussian log

Let’s Recap

Spatial generalized models are useful for highly non-Gaussian spatial data

  • Response types include binary, count, proportion, and skewed data

spglm() for model fitting and predict() (or augment()) for prediction

  • spglm() builds upon glm() to account for spatial covariance
  • Helper functions (e.g., summary()) defined for splm() objects are also defined for spglm() objects

Session Information

sessionInfo()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] tigris_2.2.1     spmodel_0.11.1   sf_1.0-21        kableExtra_1.4.0
 [5] broom_1.0.10     lubridate_1.9.4  forcats_1.0.1    stringr_1.5.2   
 [9] dplyr_1.1.4      purrr_1.1.0      readr_2.1.5      tidyr_1.3.1     
[13] tibble_3.3.0     ggplot2_4.0.0    tidyverse_2.0.0 

loaded via a namespace (and not attached):
 [1] gtable_0.3.6       xfun_0.52          lattice_0.22-6     tzdb_0.4.0        
 [5] vctrs_0.6.5        tools_4.4.1        generics_0.1.4     curl_6.2.2        
 [9] parallel_4.4.1     proxy_0.4-27       pkgconfig_2.0.3    Matrix_1.7-0      
[13] KernSmooth_2.23-24 RColorBrewer_1.1-3 S7_0.2.0           uuid_1.2-0        
[17] lifecycle_1.0.4    compiler_4.4.1     farver_2.1.2       htmltools_0.5.8.1 
[21] class_7.3-22       yaml_2.3.10        pillar_1.11.1      classInt_0.4-11   
[25] nlme_3.1-164       tidyselect_1.2.1   digest_0.6.36      mvtnorm_1.2-5     
[29] stringi_1.8.4      splines_4.4.1      labeling_0.4.3     fastmap_1.2.0     
[33] grid_4.4.1         cli_3.6.3          magrittr_2.0.4     utf8_1.2.4        
[37] e1071_1.7-16       withr_3.0.2        rappdirs_0.3.3     scales_1.4.0      
[41] backports_1.5.0    timechange_0.3.0   estimability_1.5.1 httr_1.4.7        
[45] rmarkdown_2.28     emmeans_1.10.3     hms_1.1.3          coda_0.19-4.1     
[49] evaluate_0.24.0    knitr_1.50         viridisLite_0.4.2  mgcv_1.9-1        
[53] rlang_1.1.6        Rcpp_1.0.12        xtable_1.8-4       glue_1.7.0        
[57] DBI_1.2.3          xml2_1.4.0         pROC_1.18.5        svglite_2.1.3     
[61] rstudioapi_0.16.0  jsonlite_2.0.0     plyr_1.8.9         R6_2.5.1          
[65] systemfonts_1.3.1  units_0.8-7       

Disclaimer

Disclaimer

The views expressed in this workshop are those of the authors and do not necessarily represent the views or policies of the U.S. Environmental Protection Agency or the U.S. National Oceanic and Atmospheric Administration. Any mention of trade names, products, or services does not imply an endorsement by the U.S. government, the U.S. Environmental Protection Agency, or the U.S. National Oceanic and Atmospheric Administration. The U.S. Environmental Protection Agency and the U.S. National Oceanic and Atmospheric Administration do not endorse any commercial products, services, or enterprises.

References

Andrew O. Finley, Sudipto Banerjee, and Bradley P. Carlin. 2007. spBayes: An R Package for Univariate and Multivariate Hierarchical Point-Referenced Spatial Models.” Journal of Statistical Software 19 (4): 1–24. https://www.jstatsoft.org/article/view/v019i04.
Aranda, Melina Jeanette, Marco Conedera, Gianni Boris Pezzatti, and Eric Gehring. 2024. “Landscape, Site and Post-Disturbance Forest Stand Characteristics Modulate the Colonisation of Non-Native Invasive Woody Species.” Forest Ecology and Management 565: 122017.
Baddeley, Adrian, Ege Rubak, and Rolf Turner. 2015. Spatial Point Patterns: Methodology and Applications with R. London: Chapman; Hall/CRC Press. https://www.routledge.com/Spatial-Point-Patterns-Methodology-and-Applications-with-R/Baddeley-Rubak-Turner/p/book/9781482210200/.
Brehob, Meredith M, Michael J Pennino, Jana E Compton, Qian Zhang, Marc H Weber, Ryan A Hill, Selia Markley, et al. 2025. “The US EPA’s National Nutrient Inventory: Critical Shifts in US Nutrient Pollution Sources from 1987 to 2017.” Environmental Science & Technology.
Cressie, Noel. 1985. “Fitting Variogram Models by Weighted Least Squares.” Journal of the International Association for Mathematical Geology 17 (5): 563–86.
———. 1993. Statistics for Spatial Data (Revised Edition). Wiley: Hoboken, NJ. https://doi.org/10.1002/9781119115151.
Curriero, Frank C, and Subhash Lele. 1999. “A Composite Likelihood Approach to Semivariogram Estimation.” Journal of Agricultural, Biological, and Environmental Statistics, 9–28.
Datta, Abhirup, Sudipto Banerjee, Andrew O Finley, and Alan E Gelfand. 2016. “Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geostatistical Datasets.” Journal of the American Statistical Association 111 (514): 800–812.
Doser, Jeffrey W., Andrew O. Finley, Marc Kéry, and Elise F. Zipkin. 2022. spOccupancy: An r Package for Single-Species, Multi-Species, and Integrated Spatial Occupancy Models.” Methods in Ecology and Evolution 13: 1670–78. https://doi.org/10.1111/2041-210X.13897.
———. 2024. spAbundance: An R package for univariate and multivariate spatially-explicit abundance models.” Methods in Ecology and Evolution 15: 1024–33. https://doi.org/10.1111/2041-210X.14332.
Doser, Jeffrey W, Andrew O Finley, and Sudipto Banerjee. 2023. “Joint Species Distribution Models with Imperfect Detection for High-Dimensional Spatial Data.” Ecology 104 (9): e4137.
Dumelle, Michael, Matt Higham, and Jay M. Ver Hoef. 2023. spmodel: Spatial Statistical Modeling and Prediction in R.” PLOS ONE 18 (3): e0282524. https://doi.org/10.1371/journal.pone.0282524.
Dumelle, Michael, Erin E. Peterson, Jay M. Ver Hoef, Alan Pearse, and Daniel J. Isaak. 2024. SSN2: The Next Generation of Spatial Stream Network Modeling in R.” Journal of Open Source Software 9 (99): 6389. https://doi.org/10.21105/joss.06389.
Dumelle, Michael, Jay M. Ver Hoef, Amalia Handler, Ryan A Hill, Matt Higham, and Anthony R. Olsen. 2024. “Modeling Lake Conductivity in the Conterminous United States Using Spatial Indexing for Large Spatial Data.” Spatial Statistics 59: 100808.
Ecker, Mark D, John P DeGroote, Clemir A Coproski, Bingqing Liang, John Darko, and James T Dietrich. 2025. “Urban Heat Mapping Strategies for Predicting Near-Surface Air Temperature in Unsampled Cities in Iowa.” Sustainability 17 (9): 3914.
Grossenbacher, Dena L, Magdalene S Lo, Molly E Waddington, Ryan O’Dell, and Kathleen M Kay. 2025. “Soil and Climate Contribute to Maintenance of a Flower Color Polymorphism.” American Journal of Botany, e70018.
Harville, David A. 1977. “Maximum Likelihood Approaches to Variance Component Estimation and to Related Problems.” Journal of the American Statistical Association 72 (358): 320–38.
Heaton, Matthew J, Abhirup Datta, Andrew O Finley, Reinhard Furrer, Joseph Guinness, Rajarshi Guhaniyogi, Florian Gerber, et al. 2019. “A Case Study Competition Among Methods for Analyzing Large Spatial Data.” Journal of Agricultural, Biological and Environmental Statistics 24 (3): 398–425.
Hurlbert, Stuart H. 1984. “Pseudoreplication and the Design of Ecological Field Experiments.” Ecological Monographs 54 (2): 187–211.
Johnson, Marie, Ashley Ballantyne, Jon Graham, Zachary Holden, Zachary Hoylman, Kelsey Jensco, David Ketchum, John Kimball, and Jessica Mitchell. 2025. “An Ecosystem Resilience Index That Integrates Measures of Vegetation Function, Structure, and Composition.” Ecological Indicators 171: 113076.
Legendre, Pierre. 1993. “Spatial Autocorrelation: Trouble or New Paradigm?” Ecology 74 (6): 1659–73.
Lenth, Russell V. 2024. Emmeans: Estimated Marginal Means, Aka Least-Squares Means. https://CRAN.R-project.org/package=emmeans.
Lindgren, Finn, and Håvard Rue. 2015. “Bayesian Spatial Modelling with r-INLA.” Journal of Statistical Software 63: 1–25.
Miller, Harvey J. 2004. “Tobler’s First Law and Spatial Analysis.” Annals of the Association of American Geographers 94 (2): 284–89.
Nahlik, Amanda M, Steven G Paulsen, Michael Dumelle, Susan Holdsworth, Sarah Lehmann, Nicolle S Tulve, Sean J Paul, and H Christopher Frey. 2025. “National Aquatic Resource Surveys (NARS): The Foundation for Long-Term Aquatic Monitoring Data Across the United States.” Environmental Monitoring and Assessment 197 (12): 1291.
Oksanen, Jari, Gavin L. Simpson, F. Guillaume Blanchet, Roeland Kindt, Pierre Legendre, Peter R. Minchin, R. B. O’Hara, et al. 2024. Vegan: Community Ecology Package. https://CRAN.R-project.org/package=vegan.
Patterson, Desmond, and Robin Thompson. 1971. “Recovery of Inter-Block Information When Block Sizes Are Unequal.” Biometrika 58 (3): 545–54.
Pebesma, Edzer. 2018. Simple Features for R: Standardized Support for Spatial Vector Data.” The R Journal 10 (1): 439–46. https://doi.org/10.32614/RJ-2018-009.
Robinson, David, Alex Hayes, and Simon Couch. 2025. Broom: Convert Statistical Objects into Tidy Tibbles. https://CRAN.R-project.org/package=broom.
Rumschlag, Samantha L, Brian Gallagher, Ryan Hill, Ralf B Schäfer, Travis S Schmidt, Taylor Woods, Darin Kopp, et al. 2025. “Diverging Fish Biodiversity Trends in Cold and Warm Rivers and Streams.” Nature, 1–7.
Tobler, Waldo R. 1970. “A Computer Movie Simulating Urban Growth in the Detroit Region.” Economic Geography 46 (sup1): 234–40.
Ver Hoef, Jay M, Eryn Blagg, Michael Dumelle, Philip M Dixon, Dale L Zimmerman, and Paul B Conn. 2024. “Marginal Inference for Hierarchical Generalized Linear Mixed Models with Patterned Covariance Matrices Using the Laplace Approximation.” Environmetrics, e2872.
Ver Hoef, Jay M, Michael Dumelle, Matt Higham, Erin E Peterson, and Daniel J Isaak. 2023. “Indexing and Partitioning the Spatial Linear Model for Large Data Sets.” PlOS ONE 18 (11): e0291906.
Wolfinger, Russ, Randy Tobias, and John Sall. 1994. “Computing Gaussian Likelihoods and Their Derivatives for General Linear Mixed Models.” SIAM Journal on Scientific Computing 15 (6): 1294–1310.
Zimmerman, Dale L, and Jay M Ver Hoef. 2024. Spatial Linear Models for Environmental Data. Chapman; Hall/CRC.